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Abstract 



We consider an adversarial online learning setting where a decision maker can 
choose an action in every stage of the game. In addition to observing the reward 
of the chosen action, the decision maker gets side observations on the reward he 
would have obtained had he chosen some of the other actions. The observation 
structure is encoded as a graph, where node i is linked to node j if sampling i pro- 
vides information on the reward of j. This setting naturally interpolates between 
the well-known "experts" setting, where the decision maker can view all rewards, 
and the multi-armed bandits setting, where the decision maker can only view the 
reward of the chosen action. We develop practical algorithms with provable regret 
guarantees, which depend on non-trivial graph-theoretic properties of the infor- 
mation feedback structure. We also provide partially-matching lower bounds. 



1 Introduction 

One of the most basic learning settings studied in the online learning framework is learning from 
experts. In its simplest form, we assume that each round t, the learning algorithm must choose one 
of k possible actions, which can be interpreted as following the advice of one of k "experts'^ At the 
end of the round, the performance of all actions, measured here in terms of some reward, is revealed. 
This process is iterated for T rounds, and our goal is to minimize the regret, namely the difference 
between the total reward of the single best action in hindsight, and our own accumulated reward. 
We follow the standard online learning framework, in which nothing whatsoever can be assumed 
on the process generating the rewards, and they might even be chosen by an adversary who has full 
knowledge of our learning algorithm. 

A crucial assumption in this setting is that we get to see the rewards of all actions at the end of each 
round. However, in many real-world scenarios, this assumption is unrealistic. A canonical example 
is web advertising, where at any timepoint one may choose only a single ad (or small number of ads) 
to display, and observe whether it was clicked, but not whether other ads would have been clicked 
or not if presented to the user. This partial information constraint has led to a flourishing literature 
on multi-armed bandits problems, which model the setting where we can only observe the reward 
of the action we chose. While this setting has been long studied under stochastic assumptions, the 
landmark paper [4] showed that this setting can also be dealt with under adversarial conditions, 
making the setting comparable to the experts setting discussed above. The price in terms of the 
provable regret is usually an extra \fk multiplicative factor in the bound. The intuition for this factor 
has long been that in the bandit setting, we only get "1/k of the information" obtained in the expert 
setting (as we observe just a single reward rather than k). While the bandits setting received much 
theoretical interest, it has also been criticized for not capturing additional side-information we often 

'The more general setup, which is beyond the scope of this paper, considers k experts providing advice for 
choosing among n actions, where in general n 7^ k (4). 
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have on the rewards of the different actions. This has led to studying richer settings, which make 
various assumptions on the relationship between the rewards; see below for more details. 

In this paper, we formalize and initiate a study on a range of settings that interpolates between the 
bandits setting and the experts setting. Intuitively, we assume that after choosing some action i, and 
obtaining the action's reward, we observe not just action i's reward (as in the bandit setting), and 
not the rewards of all actions (as in the experts setting), but rather some (possibly noisy) information 
on a subset of the other actions. This subset may depend on action i in an arbitrary way, and may 
change from round to round. This information feedback structure can be modeled as a sequence 
of directed graphs G\ , . . . , Gt (one per round t), so that an edge from action i to action j implies 
that by choosing action i, "sufficiently good" information is revealed on the reward of action j as 
well. The case of Gt being the complete graph corresponds to the experts setting. The case of Gt 
being the empty graph corresponds to the bandit setting. The broad scenario of arbitrary graphs in 
between the two is the focus of our study. 

As a motivating example, consider the problem of web advertising mentioned earlier. In the standard 
multi-armed bandits setting, we assume that we have no information whatsoever on whether undis- 
played ads would have been clicked on. However, in many cases, we do have some side-information. 
For instance, if two ads i,j are for similar vacation packages in Hawaii, and ad i was displayed and 
clicked on by some user, it is likely that the other ad j would have been clicked on as well. In 
contrast, if ad i is for running shoes, and ad j is for wheelchair accessories, then a user who clicked 
on one ad is unlikely to clique on the other. This sort of side-information can be better captured in 
our setting. 

As another motivating example, consider a sensor network where each sensor collects data from a 
certain geographic location. Each sensor covers an area that may overlap the area covered by other 
sensors. At every stage a centralized controller activates one of the sensors and receives input from 
it. The value of this input is modeled as the integral of some "information" in the covered area. 
Since the area covered by each of the sensors overlaps the area covered by other sensors, the reward 
obtained when choosing sensor i provides an indication of the reward that would have been obtained 
when sampling sensor j. A related example comes from ultra wideband communication networks, 
where every agent can select which channel to use for transmission. When using a channel, the 
agent senses if the transmission was successful, and also receives some indication of the noise level 
in other channels that are in adjacent frequency bands 0. 

Our results portray an interesting picture, with the attainable regret depending on non-trivial prop- 
erties of these graphs. We provide two practical algorithms with regret guarantees: the ExpBan 
algorithm that is based on a combination of existing methods, and the more fundamentally novel 
ELP algorithm that has superior guarantees. We also study lower bounds for our setting. In the 
case of undirected graphs, we show that the information-theoretically attainable regret is precisely 
characterized by the average independence number (or stability number) of the graph, namely the 
size of its largest independent set. For the case of directed graphs, we obtain a weaker regret which 
depends on the average clique-partition number of the graphs. More specifically, our contributions 
are as follows: 

• We formally define and initiate a study of the setting that interpolates between learning with 
expert advice (with 0{y/\og(k)T) regret) that assumes that all rewards are revealed and 
the multi-armed bandits setting (with O(VkT) regret) that assumes that only the reward of 
the action selected is revealed. We provide an answer to a range of models in between. 

• The framework we consider assumes that by choosing each action, other than just obtaining 
that action's reward, we can also observe some side-information about the rewards of other 
actions. We formalize this as a graph G t over the actions, where an edge between two 
actions means that by choosing one action, we can also get a "sufficiently good" estimate 
of the reward of the other action. We consider both the case where Gt changes at each 
round t, as well as the case that G t = G is fixed throughout all rounds. 

• We establish upper and lower bounds on the achievable regret, which depends on two com- 
binatorial properties of Gt'. Its independence number a(Gt) (namely, the largest number 
of nodes without edges between them), and its clique-partition number x{Gt) (namely, the 
smallest number of cliques into which the nodes can be partitioned). 
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• We present two practical algorithms to deal with this setting. The first algorithm, called 
ExpBan, combines existing algorithms in a natural way, and applies only when Gt = 
G is fixed at all T rounds. Ignoring computational constraints, the algorithm achieves a 
regret bound of 0{y/x{G) log(fc)T). With computational constraints, its regret bound is 
0(y/c\og{k)T), where c is the size of the minimal clique partition one can efficiently find 
for G. However, note that for general graphs, it is NP-hard to find a clique partition for 
which c = C(fc 1_e ) for any e > 0. 

• The second algorithm, called ELP, is an improved algorithm, which can handle graphs 
which change between rounds. For undirected graphs, where sampling i gives an obser- 
vation on j and vice versa, it achieves a regret bound of 0(y log(fc) Y2t=i a (^t))- F° r 
directed graphs (where the observation structure is not symmetric), our regret bound is 

at most 0{\J\og{k) YLt=i x(Gt))- Moreover, the algorithm is computationally efficient. 
This is in contrast to the ExpBan algorithm, which in the worst case, cannot efficiently 
achieve regret significantly better than O ( y/k log(fc)T) . 

• For the case of a fixed graph G t = G, we present an information-theoretic O \ 'a(G)T^j 
lower bound on the regret, which holds regardless of computational efficiency. 

• We present some simple synthetic experiments, which demonstrate that the potential ad- 
vantage of the ELP algorithm over other approaches is real, and not just an artifact of our 
analysis. 



1.1 Related Work 



The standard multi-armed bandits problem assumes no relationship between the actions. Quite a few 
papers studied alternative models, where the actions are endowed with a richer structure. However, 
in the large majority of such papers, the feedback structure is the same as in the standard multi-armed 
bandits. Examples include ifTTl . where the actions' rewards are assumed to be drawn from a statis- 
tical distribution, with correlations between the actions; and HUH), where the actions reward's are 
assumed to satisfy some Lipschitz continuity property with respect to a distance measure between 
the actions. 

In terms of other approaches, the combinatorial bandits framework [7 | considers a setting slightly 
similar to ours, in that one chooses and observes the rewards of some subset of actions. However, it 
is crucially assumed that the reward obtained is the sum of the rewards of all actions in the subset. 
In other words, there is no separation between earning a reward and obtaining information on its 
value. Another relevant approach is partial monitoring, which is a very general framework for 
online learning under partial feedback. However, this generality comes at the price of tractability for 
all but specific cases, which do not include our model. 

Our work is also somewhat related to the contextual bandit problem (e.g., [9, 10|), where the stan- 
dard multi-armed bandits setting is augmented with some side-information provided in each round, 
which can be used to determine which action to pick. While we also consider additional side- 
information, it is in a more specific sense. Moreover, our goal is still to compete against the best 
single action, rather than some set of policies which use this side-information. 



2 Problem Setting 

Let [k] = {1, . . . , k} and [T] = {1, . . . , T}. We consider a set of actions 1,2, ... ,k. Choosing 
an action i at round t results in receiving a reward cji{t), which we shall assume without loss of 
generality to be bounded in [0, 1]. Following the standard adversarial framework, we make no 
assumptions whatsoever about how the rewards are selected, and they might even be chosen by an 
adversary. We denote our choice of action at round t as i t . Our goal is to minimize regret with 
respect to the best single action in hindsight, namely 

T T 

max^Si(t) - ^9i t {t). 
1 t=\ t=i 
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Algorithm 1 The ExpBan Algorithm 
Input: neighborhood sets {Ni(t)}i e m. 

Split the graph induced by the neighborhood sets into c cliques (c < k as small as possible) 
For each clique, define a "meta-action" to be a standard experts algorithm over the actions in the 
clique 

Run a multi-armed-bandits algorithm over the c meta-actions 



For simplicity, we will focus on a finite-horizon setting (where the number of rounds T is known in 
advance), on regret bounds which hold in expectation, and on oblivious adversaries, namely that the 
reward sequence g^t) is unknown but fixed in advance (see Sec.[8]for more on this issue). 

Each round t, the learning algorithm chooses a single action i t . In the standard multi-armed bandits 
setting, this results in gi t (t) being revealed to the algorithm, while gj (t) remains unknown for any 
j i t . In our setting, we assume that by choosing an action i, other than getting gi(t), we also 
get some side-observations about the rewards of the other actions. Formally, we assume that one 
receives gi(t), and for some fixed parameter b is able to construct unbiased estimates gj(t) for all 
actions j in some subset of [k], such that E[g 3 (i)| action i chosen] = gj(t) and Pr(|<?j(i)| < b) = 1. 
For any action j, we let Nj (t) be the set of actions, for which we can get such an estimate gj (t) on the 
reward of action j. This is essentially the "neighborhood" of action j, which receives sufficiently 
good information (as parameterized by b) on the reward of action j. We note that j is always a 
member of Nj, and moreover, Nj may be larger or smaller depending on the value of b we choose. 
We assume that Nj (t) for all j, t are known to the learner in advance. 

Intuitively, one can think of this setting as a sequence of graphs, one graph per round t, which 
captures the information feedback structure between the actions. Formally, we define Gt to be a 
graph on the k nodes 1, . . . , k, with an edge from node i to node j if and only if j E Ni(t), In the 
case that j E Ni(t) if and only if i E Nj(t), for all i, j, we say that Gt is undirected. We will use 
this graph viewpoint extensively in the remainder of the paper. 



3 The ExpBan Algorithm 

We begin by presenting the ExpBan algorithm (see Algorithm [T] above), which builds on existing 
algorithms to deal with our setting, in the special case where the graph structure remains fixed 
throughout the rounds - namely, Gt = G for all t. The idea of the algorithm is to split the actions 
into c cliques, such that choosing an action in a clique reveals unbiased estimates of the rewards of 
all the other actions in the clique. By running a standard experts algorithm (such as the exponen- 
tially weighted forecaster - see [6, Chapter 2]), we can get low regret with respect to any action in 
that clique. We then treat each such expert algorithm as a meta-action, and run a standard bandits 
algorithm (such as the EXP3 [4]) over these c meta-actions. We denote this algorithm as ExpBan, 
since it combines an experts algorithm with a bandit algorithm. 

The following result provides a bound on the expected regret of the algorithm. The proof appears in 
the appendix. 

Theorem 1. Suppose Gt — G is fixed for all T rounds. If we run ExpBan using the exponentially 
weighted forecaster and the EXP 3 algorithm, then the expected regret is bounded as follows^ 



< Aby/clog(k)T. (1) 



For the optimal clique partition, we have c — x(G), the clique-partition number of G. 

It is easily seen that x(G) is a number between 1 and k. The case x(G) = 1 corresponds to G being 
a clique, namely, that choosing any action allows us to estimate the rewards of all other actions. 
This corresponds to the standard experts setting, in which case the algorithm attains the optimal 
0(-v/log(fe)T) regret. At the other extreme, x(G) = k corresponds to G being the empty graph, 



2 Using more sophisticated methods, it is now known that the log(fe) factor can be removed (e.g., t3l). 
However, we will stick with this slightly less tight analysis for simplicity. 
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namely, that choosing any action only reveals the reward of that action. This corresponds to the 
standard bandit setting, in which case the algorithm attains the standard 0(y/log(k)kT) regret. For 
general graphs, our algorithm interpolates between these regimes, in a way which depends on x(G). 

While being simple and using off-the-shelf components, the ExpBan algorithm has some disadvan- 
tages. First of all, for a general graph G, it is ./VP-hard to find c < (D(k 1 ~ e ) for any e > 0. (This 
follows from lfl2ll and the fact that the clique-partition number of G equals the chromatic number 
of its complement.) Thus, with computational constraints, one cannot hope to obtain a bound better 
than <D(y/kT). That being said, we note that this is only a worst-case result, and in practice or for 
specific classes of graphs, computing a good clique partition might be relatively easy. A second 
disadvantage of the algorithm is that it is not applicable for an observation structure that changes 
with time. 

4 The ELP Algorithm 

We now turn to present the ELP algorithm (which stands for "Exponentially-weighted algorithm 
with Linear Programming"). Like all multi-armed bandits algorithms, it is based on a tradeoff 
between exploration and exploitation. However, unlike standard algorithms, the exploration com- 
ponent is not uniform over the actions, but is chosen carefully to reflect the graph structure at each 
round. In fact, the optimal choice of the exploration requires us to solve a simple linear program, 
hence the name of the algorithm. Below, we present the pseudo-code as well as a couple of theorems 
that bound the expected regret of the algorithm under appropriate parameter choices. The proofs of 
the theorems appear in the appendix. The first theorem concerns the symmetric observation case, 
where if choosing action i gives information on action j, then choosing action j must also give in- 
formation on i. The second theorem concerns the general case. We note that in both cases the graph 
Gt may change arbitrarily in time. 



Algorithm 2 The ELP Algorithm 

Input: (}, {■y(t)} te {T],{si(t)} ie [k],t£[T], neighborhood sets {iV 4 (i)} l6[fc]ite[T] . 
Vj€[fr] Wj(l) :=l/k. 
fort = l,...,Tdo 

V i G [k] Pt (t) := (1 - 7®) r® {k) + 7 (t)*i(t) 

Choose action i t with probability p it (t), and receive reward g it (t) 
Compute gj (i) for all j G Ni t (t) 

For all j G [k], let g 3 (i) = r ~ 9j{t) ( ) if i t G Nj (t), and gj (t) = otherwise. 

V j G [k] Wj (t + 1) = w (t) exp(/% (t)) 
end for 



4.1 Undirected Graphs 

The following theorem provides a regret bound for the algorithm, as well as appropriate parameter 
choices, in the case of undirected graphs. Later on, we will discuss the case of directed graphs. In 
a nutshell, the theorem shows that the regret bound depends on the average independence number 
a(Gt) of each graph Gt - namely, the size of its largest independent set. 

Theorem 2. Suppose that for all t, Gt is an undirected graph. Suppose we run Algorithm^using 
some (3 G (0, l/2bk), and choosing 

{Si(t)} ie [k] = argmax min V si{t), 

Vi S< (i)>o,E iSi (*)=i 3e[k] l£Nj (t) 

(which can be easily done via linear programming) and"f(t) — /Sb/min^-g^] $ZzeiV-(t) S 'W- Then 
it holds for any fixed action j that 

T 

t=i 



T 



t=l 



< 3/36^a(G t ) + 

4=1 



log(fc) 



(2) 
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If we choose (3 — ^/log(fc) /3b 2 a(Gt), then tlie bound equals 



31og(fe)^a(G t ). 



(3) 



t=i 



Comparing Thm. |2] with Thm. [T| we note that for any graph Gt, its independence number a{Gt) 
lower bounds its clique-partition number x(Gt). In fact, the gap between them can be very large 
(see Sec.|6]l. Thus, the attainable regret using the ELP algorithm is better than the one attained by the 
ExpBan algorithm. Moreover, the ELP algorithm is able to deal with time-changing graphs, unlike 
the ExpBan algorithm. 

If we take worst-case computational efficiency into account, things are slightly more involved. 
For the ELP algorithm, the optimal value of f3, needed to obtain Eq. requires knowledge of 
Y^t=i a (Gt), but computing or approximating the a(Gt) is NP-hard in the worst case. However, 
there is a simple fix: we create [log(fc)] copies of the ELP algorithm, where copy i assumes that 

J2t=i a ( G t) e q uals 2 J_1 - Note that 

one of these values must be wrong by a factor of at most 2, 
so the regret of the algorithm using that value would be larger by a factor of at most 2. Of course, 
the problem is that we don't know in advance which of those [log(fc)] copies is the best one. But 
this can be easily solved by treating each such copy as a "meta-action", and running a standard 
multi-armed bandits algorithm (such as EXP3) over these [log(/c)] actions. Note that the same idea 
was used in the construction of the ExpBan algorithm. Since there are |~log(/c)] meta-actions, the 

additional regret incurred is O {\J log 2 (k)T) . So up to logarithmic factors in k, we get the same 
regret as if we could actually compute the optimal value of f3. 



4.2 Directed Graphs 



So far, we assumed that the graphs we are dealing with are all undirected. However, a natural 
extension of this setting is to assume a directed graph, where choosing an action i may give us 
information on the reward of action j, but not vice-versa. It is readily seen that the ExpBan algorithm 
would still work in this setting, with the same guarantee. For the ELP algorithm, we can provide the 
following guarantee: 

Theorem 3. Under the conditions of Thm. ^(with the relaxation that the graphs Gt may be di- 
rected), it holds for any fixed action j that 



T 

E 

t=i 



T 

E 

,t=i 



9u(t) 



T 

1,2 x 



< 3/30 x(G t ),+ 



log(fc) 



(3 



(4) 



where x(Gt) is the clique-partition number of Gt- If we choose (3 = yjloglk) /3b 2 ^2 t x(Gt), then 
the bound equals 



T 

E: 



31og(fc)Vx(G t ) 



(5) 



Note that this bound is weaker than the one ofThm.|2| since a(Gt) < x{Gt) as discussed earlier. We 
do not know whether this bound (relying on the clique-partition number) is tight, but we conjecture 
that the independence number, which appears to be the key quantity in undirected graphs, is not the 
correct combinatorial measure for the case of directed graphs^] In any case, we note that even with 
the weaker bound above, the ELP algorithm still seems superior to the ExpBan algorithm, in the 
sense that it allows us to deal with time-changing graphs, and that an explicit clique decomposition 
of the graph is not required. Also, we again have the issue of (3 which is determined by a quantity 
which is NP-hard to compute, i.e. x(G*t)- However, this can be circumvented using the same trick 
discussed in the context of undirected graphs. 



It is possible to construct examples where the analysis of the ELP algorithm necessarily leads to an 
0(sjk log(fc)T) bound, even when the independence number is 1 
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5 Lower Bound 



The following theorem provides a lower bound on the regret in terms of the independence number 
a(G), for a constant graph G t = G. 

Theorem 4. Suppose G t = G for all t, and that actions which are not linked in G get no 
side-observations whatsoever between them. Then there exists a (randomized) adversary strat- 
egy, such that for every T > 374a(G) 3 and any learning strategy, the expected regret is at least 
QMy/a{G)T. 

A proof is provided in the appendix. The intuition of the proof is that if the graph G has a(G) 
independent vertices, then an adversary can make this problem as hard as a standard multi-armed 
bandits problem, played on a(G) actions. Using a known lower bound of f2(V nT) for multi-armed 
bandits on n actions, our result follow^] 

For constant undirected graphs, this lower bound matches the regret upper bound for the ELP al- 
gorithm (Thm. [2]l up to logarithmic factors. For directed graphs, the difference between them boils 
down to the difference between x(G) and a(G). For many well-behaved graphs, this gap is rather 
small. However, for general graphs, the difference can be huge - see the next section for details. 

6 Examples 

Here, we briefly discuss some concrete examples of graphs G, and show how the regret performance 
of our algorithms depend on their structure. An interesting issue to notice is the potential gap 
between the performance of our algorithms, through the graph's independence number a(G) and 
clique-partition number x(G). 

First, consider the case where there exists a single action, such that choosing it reveals the rewards 
of all the other actions. In contrast, choosing the other actions only reveal their own reward. At first 
blush, it may seem that having such a "super-action", which reveals everything that happens in the 
current round, should help us improve our regret. However, the independence number a(G) of such 
a graph is easily seen to be k — 1. Based on our lower bound, we see that this "super-action" is 
actually not helpful at all (up to negligible factors). 

Second, consider the case where the actions are endowed with some metric distance function, and 
edge (i,j) is in G if and only if the distance between i, j is at most some fixed constant r. We can 
think of each action i as being in the center of a sphere of radius r, such that the reward of action i is 
propagated to every other action in that sphere. In this case, a(G) is essentially the number of non- 
overlapping spheres we can pack in G. In contrast, x(G) is essentially the number of spheres we 
need to cover G. Both numbers shrink rapidly as r increases, improving the regret of our algorithms. 
However, the sphere covering size can be much larger than the sphere packing size. For example, if 
the actions are placed as the elements in {0, 1/2, 1}™, we use the metric, and r E (1/2, 1), it is 
easily seen that the sphere packing number is just 1. In contrast, the sphere covering number is at 
least 2™ — /c'°S3( 2 ) ~ A: - 63 , since we need a separate sphere to cover every element in {0, 1}™. 

Third, consider the random Erdos - Renyi graph G = G(k,p), which is formed by linking every 
action i to every action j with probability p independently. It is well known that when p is a con- 
stant, the independence number a(G) of this graph is only C(log(A:)), whereas the clique-partition 
number x(G) is at least f2(fc/ \og(k)). This translates to a regret bound of O(VkT) for the Exp- 

Ban algorithm, and only 0(^J\og 2 (k)T) for the ELP algorithm. Such a gap would also hold for a 
directed random graph. 

7 Empirical Performance Gap between ExpBan and ELP 

In this section, we show that the gap between the performance of the ExpBan algorithm and the ELP 
algorithm can be real, and is not just an artifact of our analysis. 

4 We note that if the maximal degree of every node is bounded by d, it is possible to get the lower bound for 
T > n(d 2 a(G)) (as opposed to T > f2(a(G) 3 )); see the proof for details. 
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Figure 1: Experiments on random graphs. 

To show this, we performed the following simple experiment: we created a random Erdos - Renyi 
graph over 300 nodes, where each pair of nodes were linked independently with probability p. 
Choosing any action results in observing the rewards of neighboring actions in the graph. The reward 
of each action at each round was chosen randomly and independently to be 1 with probability 1/2 
and with probability 1/2, except for a single node, whose reward equals 1 with a higher probability 
of 3/4. We then implemented the ExpBan and ELP algorithms in this setting, for T = 30, 000. For 
comparison, we also implemented the standard EXP3 multi-armed bandits algorithm [4], which 
doesn't use any side-observations. All the parameters were set to their theoretically optimal values. 
The experiment was repeated for varying p and over 10 independent runs. 

The results are displayed in Figure [T] The X-axis is the iteration number, and the F-axis is the 
mean payoff obtained so far, averaged over the 10 runs (the variance in the numbers was minuscule, 
and therefore we do not report confidence intervals). For p = 0.05, the graph is rather empty, and 
the advantage of using side observations is not large. As a result, all 3 algorithms perform roughly 
the same for this choice of T. As p increases, the value of side-obervations increase, and the the 
performance of our two algorithms, which utilize side-observations, improves over the standard 
multi-armed bandits algorithm. Moreover, for intermediate values of p, there is a noticeable gap 
between the performance of ExpBan and ELP. This is exactly the regime where the gap between 
the clique-partition number (governing the regret bound of ExpBan) and the independence number 
(governing the regret bound for the ELP algorithm) tends to be larger as welj^] Finally, for large p, 
the graph is almost complete, and the advantage of ELP over ExpBan becomes small again (since 
most actions give information on most other actions). 

8 Discussion 

In this paper, we initiated a study of a large family of online learning problems with side observa- 
tions. In particular, we studied the broad regime which interpolates between the experts setting and 
the bandits setting of online learning. We provided algorithms, as well as upper and lower bounds 
on the attainable regret, with a non-trivial dependence on the information feedback structure. 

There are many open questions that warrant further study. First, the upper and lower bounds essen- 
tially match only in particular settings (i.e., in undirected graphs, where no side-observations what- 
soever, other than those dictated by the graph are allowed). Can this gap be narrowed or closed? 
Second, our lower bounds depend on a reduction which essentially assumes that the graph is con- 
stant over time. We do not have a lower bound for changing graphs. Third, it remains to be seen 
whether other online learning results can be generalized to our setting, such as learning with respect 
to policies (as in EXP4 [4|) and obtaining bounds which hold with high probability. Fourth, the 
model we have studied assumed that the observation structure is known. In many practical cases, 
the observation structure may be known just partially or approximately. Is it possible to devise 
algorithms for such cases? 

Acknowledgements. This research was supported in part by the Google Inter-university center for 
Electronic Markets and Auctions. 

intuitively, this can be seen by considering the extreme cases - for a complete graph over k nodes, both 
numbers equal 1, and for an empty graph over k nodes, both numbers equal k. For constant p G (0, 1), there is 
a real gap between the two, as discussed in Sec.lrjl 
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A Proofs 



A.1 Proof of Thm.[T] 

Suppose we split the actions into c cliques Ci, C%, . . . , C c . First, let us consider the expected regret 
of the exponentially weighted forecaster ran over any such clique. Denoting the actions of the clique 
by 1, ... ,n, the forecaster works as follows: first, it initializes weights w\, . . . , w n to be 1. At each 
round, it picks an action i with probability Wi/^Wi, receives the reward gi(t), and observes the 
noisy reward value <jj(t) for each of the other actions. It then updates Wi = Wi exp(f3gi(t)) (for 
some parameter (3 G (0, l/b)) for all i = 1, . . . , n. 

The analysis of this algorithm is rather standard, with the main twist being that we only observe 
unbiased estimates of the rewards, rather than the actual reward. For completeness, we provide this 
analysis in the following lemma. 

Lemma 1. The expected regret of the forecaster described above, with respect to the actions in 
clique \Ci\ and under the optimal choice of the parameter /3 is at most b^\og(\C'i\)T. 



Proof. We define the potential function Wt — Y^j=i w j (*)> an d 8 et tnat 

For notational convenience, let Pj(t) = j^" w t (t) ' Since 9j(t) — an d P — lA we nave 
Pfij (t) < 1. Thus, we can use the inequality cxp(a;) < 1 + x + x 2 (which holds for any x < 1), and 
get the upper bound 

n n n 

E Pj (t) (1 + fa (t) + 2/3% (i) 2 ) = 1 + /3 E ^ (*) + /? 2 E ft (*)& W 2 • 

Taking logarithms and using the fact that log(l + x) < x, we get 

/ W \ - - 



t / 

Summing over all t, and canceling the resulting telescopic series, we get 

log < E ^Eft^iW+^Eft^w 2 ■ ^ 

V 1 y *=i \ i =1 i =1 I 

Also, for any fixed action i, we have 

^ ( ^) * ^ ( V %^) =f>£,m ~ log(n). (7) 

t=i 



V Wi J ~ °\ W : 
Combining Eq. (|6]l with Eq. (|7]i and rearranging, we get 



T T n ] / \ T n 

t=i t=i j=i ' t=i j=\ 

Taking expectations on both sides, and using the facts that E[cjj(t)} = gj(t) for all j, t, and |<7j(t)| < 
b with probability 1, we get 

T T n 



log(») 

Thus, by picking (3 = yJlog(n)/b 2 T, we get that the expected regret is at most by/\og{n)T . □ 



£»(*) - EEftWftW < ^ +/3 &2r - 
t=i *=i j=i " 
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Now, we define each such forecaster (one per clique d) as a meta-action, and run the EXP3 al- 
gorithm on the c meta-actions. By the standard guarantee for this algorithm (see corollary 3.2 in 
|4|), the expected regret incurred by that algorithm with respect to any fixed meta-action is at most 
Zb^/ c log(c)T. Combining this with Lemma[T] we get that the total expected regret of the ExpBan 
algorithm with respect to any single action is at most 

max by/log(\Ci\)T + Zbyf c\og(c)T < b^J\og{k)T + 3&V 'clog(fc)T ', 

i 

which is at most 4byf\og{k)cT since c > 1. 
A.2 Proof of Thm.|2] 

To prove the theorem, we will need three lemmas. The first one is straightforward and follows from 
the definition of (jj(t). The second is a key combinatorial inequality. We were unable to find an 
occurrence of this inequality in any previous literature, although we are aware of very special cases 
proven in the context of cyclic sums (see for instance [5]). The third lemma allows us to derive a 
more explicit bound by examining a particular choice of {■Sj(i)}ie[fe],ie[T]- 
Lemma 2. For all fixed t, j, we have 



E &(*)]=&(*) 



as well as 



E 



< b *Y *® . 



Proof. It holds that 

k 



E [<?}(*)] = y^j?i(t)E[ffj(t) | action i was picked] = pi(t) = 

i=l i<£Nj(t) 



9j(t) 



Pi(t) 



As to the second part, we have 

k 



E 



i=i 
fe 



= Pj(t)pj(t)E [(jj(t) 2 | action i was picked] 

,9 k 



<E E Pi(t)Pi(t)-, o 

j=li£iVj(i) (Ei e jV,(t)^(*) 



* 2 E 



Eigw,(t)Pi(*)' 



□ 



Lemma 3. Lef G be a graph over k nodes, and let a(G) denote the independence number of G ( i.e., 
the size of its largest independent set). For any j € [k], define Nj to be the nodes adjacent to node j 
( including node j). Let pi , . . . , pk be arbitrary positive weights assigned to the node. Then it holds 
that 

Proof. We will actually prove the claim for any nonnegative weights p\, . . . ,£>/. (i.e., they are al- 
lowed to take values), under the convention that if pj — and Ezsat Pi — as well, then 

Hi^iPi/T^ieNiPi = L 

Suppose on the contrary that there exist some values for pi , . . . , p^ such that Ej=i Pi I E;e w Pi 
a(G). Now, if pi, . . . , pk are non-zero only on an independent set S, then 



E 



i EzeWi P* 



ies 
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Since \S\ < a{G), it follows that there exist some adjacent nodes r, s such that p r ,p s > 0. However, 

we will show that in that case, we can only increase the value of Ej=i Pi/^2ieN- Pi by shifting the 
entire weight p r +p s to either node r or node s, and putting weight at the other node. By repeating 
this process, we are guaranteed to eventually arrive at a configuration where the weights are non-zero 
on an independent set. But we've shown above that in that case, Ej=i Pi/ E;eiV- Pi — so 
this means the value of this expression with respect to the original configuration was at most a(G) 
as well. 

To show this, let us fix p r +p s = c (so that p s = c—p r ) and consider how the value of the expression 

changes as we vary p r . The sum in the expression Ei=i Pij J2ieN- Pi can ^ e s P^ t t0 ^ P^ts: when 
i = r, when i = s, when i is a node adjacent to s but not to r, when i is adjacent to r but not to s, 
when i is adjacent to both, and when i is adjacent to neither of them. Decomposing the sum in this 
way, so that p r appears everywhere explicitly, we get 

Pr | C - p r | JH 

C +ElEN r \r,sPl c + T,i£ Nj \r,sPl i:{r ^ Ni=a ~ Pr + T,leN z \ S Pl 

,:{r, s }niV,= r ?,r + ^e ,V -V R i:ig{r, S },r,«CJV< C + ^leN z \{r,s} Pi i:{r . >s}rWi =0 Pl 

It is readily seen that each of the 6 elements in the sum above is convex in p r . This implies that the 
maximum of this expression is attained at the extremes, namely either p r = (hence p s = c) or 
p r = c (hence p s = 0). This proves that indeed shifting weights between adjacent nodes can only 

increase the value of 5Z*L 1 Pij ^ZieN- Pi> an< ^ as discussed earlier, implies the result stated in the 
lemma. □ 



Lemma 4. Consider a graph G over nodes 1, . . . , k, and let a(G) be its independence number. For 
any j € [k], define Nj to be the nodes adjacent to node j ( including node j). Then there exist values 
of si, . . . , Sfc on the k-simplex, such that 

—. U < a(G). (8) 

mm ie[fe] hieN s s i 



Proof. Let S be a largest independent set of G, so that |5| = a(G). Consider the following specific 
choice for the values of s\, . . . , For any j such that j G S, let Sj = l/a(G), and Sj = 
otherwise. Suppose there was some node j such that J^ieN- s i = 0- By the way we chose values 
for si, . . . , Sfc, this implies that node j is not adjacent to any node in S, so S U {j} would also be 
an independent set, contradicting the assumption that S is a largest independent set. But since each 
value of si is either or l/a(G), it follows that YlieN- Sl > l/ct(G). This is true for any node j, 
from which Eq. follows. □ 



We now turn to the proof of the theorem itself. 



Proof of Thm.^ With the key lemmas at hand, most of the remaining proof is rather similar to 
the standard analysis for multi-armed bandits (e.g., (4]). We define the potential function Wt = 

S_y=i w j C0> an ^ 8 et triat 

We have that (3c/j(t) < 1, since by definition of /3 and (jj(t), 

p~, t s K Pb < 13b = pb Mn )£|tl E M](i)S ,(t) ^ i 

3 ~~ Ei 6 JVi(t)Pi(*) ~ Ei 6 jvi{t)7(*)»l(*) EieJVi(*) s i(*) P h 
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Using the definition of Pj(t) and the inequality exp(x) < 1 + x + x 2 for any x < 1, we can upper 
bound Eq. Q by 



< i + 5> m (t) + i^W) pP^M 2 - 

Taking logarithms and using the fact that log(l + x) < x, we get 

Summing over all t, and canceling the resulting telescopic series, we get 

'W T +i\ . /3 : ~ 



H^f ) ± EgT^ft(*)»(*)+Egi^»(*)Si(*) a . do) 

Also, for any fixed action i, we have 



Combining Eq. ( fT~0| > with Eq. (Ill and rearranging, we get 

-EE r=^ft w < m*) + E E ^ (t) * ( ' )2 

Taking expectations on both sides, and using Lemma|2] we get 



After some slight manipulations, and using the fact that gj(t) 6 [0, 1] for all j, t, we get 

t=i t=ij=i t=i ^ t=i 1 ^W^LteJfjfoKW 

We note that 1/(1 — 7(f)) can be upper bounded by 2, since by definition of Si(t), 

7(*) = . ^ ^ 7T < — =^ 7^ < /36fc < 1/2. 

max ai ,..., 0fe mm ie[fc] EieN^t) mm je[fe] Ej e Jv i (i)( 1 /«) 

Plugging this in as well as our choice of j(t) in the Ef 7(0 term, and slightly simplifying, we get 
the upper bound 

x>(t)-f> it( t)] < & (j:-. — ^ — m +2 E v Mt) J +T- 

(12) 

Now, we recall that the {s,(i)} terms were chosen so as to minimize the bound above. Thus, we can 
upper bound it by any fixed choice of {sj(f)}. Invoking Lemma|4] as well as Lemma[3j the theorem 
follows. □ 
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A.3 Proof of Thm.|3] 



The proof is very similar to the one of Thm.[2] so we'll only point out the differences. 

Referring to the proof of Thm. |2]in Subsection |A.2| The analysis is identical up to Eq. ( 12 1. To up- 
per bound the terms there, we can still invoke Lemma [4] However, Lemma [3] which was used 
to upper bound Ej=i Pj(t)/ YlieN (t) Pl(t)> not l° n g er applies (in fact, one can show specific 
counter-examples). Thus, in lieu of Lemma [3] we will opt for the following weaker bound: Let 
Ci, . . . , Cx{G t ) b e a smallest possible clique partition of Gt- Then we have 



x(G t ) 

E E 

i=l jec, 



Pj(t) 



Pl(t) 



x(G t ) 

^E 



E 



wit) 



x(G t 



Plugging this upper bound as well as Lemma|4]into Eq. ( 12 1, and using the fact that a(Gt) < x(Gt) 
for any graph G t , the result follows. 



A.4 Proof of TheoremS] 

Suppose that we are given a graph G with an independence number a(G). Let Af denote an inde- 
pendent set of a(G) nodes (i.e., no two nodes are connected). Suppose we have an algorithm A 
with a low expected regret for every sequence of rewards. We will use this algorithm to form an 
algorithm for the standard multi-armed bandits problem (with no-side observations). We will then 
resort to the known lower bound for this problem, to get a lower bound for our setting as well. 

Consider first a standard multi-armed bandits game on a(G) actions (with no side-observations), 
with the following randomized strategy for the adversary: the adversary picks one of the ot{G) 
actions uniformly at random, and at each round, assigns it a random Bernoulli reward with parameter 
1/2 + e (where e will be specified later). The other actions are assigned a random Bernoulli reward 
with parameter 1/2. Roughly speaking, Theorem 6.11 of |6| shows that with this strategy and for 
e = Q(\/a(G)/T), the expected regret of any learning algorithm is at least £l(y/a(G)T). 

Now, suppose that for the setting with side-observations, played over the graph G, there exists a 
learning strategy A that achieves expected cumulative regret of at most R^(T), for the graph G over 
T rounds, with respect to any adversary strategy. We will now show how to use A for the standard 
multi-armed bandits game described above. To that end, arbitrarily assign the a(G) actions to the 
a(G) independent nodes in Af. We will then implement the following strategy A': whenever A 
chooses one of the actions in Af, we choose the corresponding action in the multi-armed bandits 
problem and feed the reward back to A (the reward of all neighboring nodes is 0, which we feed 
back to A as well). Whenever A chooses a node j not in Af, we use the next \Nj <~)Af\ rounds (where 
Nj is the neighborhood set of j) to do "pure exploration:" we go over all the neighbors of node j 
that belong to Af in some fixed order, and choose each of them once (since rewards are assumed 
stochastic the order does not matter). Nodes in Nj \ Af are known to yield a reward of 0. The 
rewards of node j and all its neighbors are then fed to A, as if they were side observations obtained 
in a single round by choosing a node not in Af. Since the rewards are chosen i.i.d., the distribution of 
these rewards is identical to the case where A was really implemented with side-observations. We 
denote i?_4< (T) as the expected regret of this strategy A', after T rounds. 

We make the following observation: suppose A achieves an expected regret satisfying 

Ra(T) < y/a(G)T 

(we can assume this since our goal is to provide a lower bound which will only be smaller). Then 
the number of times A chose actions outside Af must be smaller than 2y/a(G)T. This is because 
whenever A chooses an action not in Af it receives a reward of while the highest expected reward 
is bigger than 1/2, so the expected per-round regret would increase by at least 1/2. 

We apply A' at each round, till A is called T times. Let T' be the (possibly random) number of 
rounds which elapsed. It holds that T' > T, since we have the T' — T pure exploration rounds 
where A is not called. In these exploration rounds, we pull arms in Af, so our expected regret in 
those rounds is at most e. Moreover, by the observation above, the number of such rounds is at 
most 2a(G)y/a(G)T, since A may choose an action outside Af at most 2y/a(G)T times, and this 
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follows by at most \J\f\ = a(G) pure exploration steps. In rounds where we do not do exploration 
steps, the expected per-round regret of A' is the same as the expected per-round regret of A. Overall, 
this implies that 

R A '{T') < R A {T) + 2ea(G)^a(G)T (13) 

Since the expected regret is monotone in the number of rounds, we can lower bound R A i(T') by 
Ra'{T). Rearranging, we get 

Ra(T) > R A >(T) - 2ea(G)y/a(G)T. 

Now, A' is a strategy for the standard multi-armed bandits setting, with a randomized adversary 
strategy which is identical to the one used to establish the lower bound of [6 Theorem 6. 1 1]. Using 
this lower bound, by selecting e = yJc\a{G)/T with c\ = 1/(8 ln(4/3)), we obtain 

Ra(T) > y/Ta(G)c 2 - 2^a(G) 2 , (14) 
where the first term of the right hand side comes from Page 168 in |6 1 and 

V2- 1 
C2 ~ v/321n(4/3)' 

Since T > 16a(G) 3 ; c x / 'c\, we have that R A {T) > y / Ta{G)c 2 /2. Plugging in the values of c x , c 2 
above, the result follows. 



Finally, we note that if the maximal degree of any node in G is bounded by d, then Eq. ( 13 1 can be 
improved to 

Ra>(T') < R A (T) + 2ed^a{G)T, 

since the number of pure-exploration steps following a call to A is at most d rather than a(G). 
Repeating the analysis above, we get that Eq. ([14| is replaced by 

Ra(T) > y/Ta(G)c2 - 2^da(G). 

This allows us to give the same lower bound, for any T > l6a(G)d 2 c\ / c^, as opposed to T > 
16a(G) 3 ci/c2 as before. 
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